US Health Care Data Breaches¶

Forecasting US Healthcare Data Breaches¶

DSA610 Buffalo State University Spring 2023¶

Steve Lucas¶

Project Data Science Lifecycle¶

Strategy & Predictive Goals¶

Following the Project Data Science Lifecycle as a process structure, utilize the Health & Human Service Data Breach as a central dataset in extrapolating meaningful predictions or findings about Health Care related Data Breaches in the US. In particular, using time series forecasting, attempt to predict number of breaches, the breach types, and locations of breaches. This will involving training and utilizing multiple time series models. Useful time series models will probably operate in monthly frequencies.

Libraries, Data, & Configuration¶

In [ ]:
import math
import pandas as pd, numpy as np, plotly.express as px, plotly.graph_objs as go, plotly.figure_factory as ff, plotly.io as pio
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
import pmdarima as pm
from sklearn.metrics import mean_absolute_error, mean_squared_error
pio.renderers.default = "plotly_mimetype+notebook"

https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf# Select Archive, then Research Report, and then excel or csv export to get raw data. Direct Link cannot be used due to server side session dynamic obfuscation on client side JS.

Latest Data Retrieval 5/5/2023

Because we are concerned with time series analysis, the data should be read in, sorted chronologically, and eventually indexed via time. We will extract the year separately for some visual analysis.

In [ ]:
df = pd.read_csv("breach_report.csv")
df['Breach Submission Date'] = pd.to_datetime(df['Breach Submission Date'], format='%m/%d/%Y')
df['Year of Breach Submission'] = df['Breach Submission Date'].dt.year
df.sort_values(by='Breach Submission Date', ascending=True, inplace=True)
df.reset_index(drop=True, inplace=True)
In [ ]:
df.head()
Out[ ]:
Name of Covered Entity State Covered Entity Type Individuals Affected Breach Submission Date Type of Breach Location of Breached Information Business Associate Present Web Description Year of Breach Submission
0 Brooke Army Medical Center TX Healthcare Provider 1000.0 2009-10-21 Theft Paper/Films No A binder containing the protected health infor... 2009
1 Mid America Kidney Stone Association, LLC MO Healthcare Provider 1000.0 2009-10-28 Theft Network Server No Five desktop computers containing unencrypted ... 2009
2 Alaska Department of Health and Social Services AK Healthcare Provider 501.0 2009-10-30 Theft Other, Other Portable Electronic Device No The Alaska Department of Health and Social Ser... 2009
3 Health Services for Children with Special Need... DC Health Plan 3800.0 2009-11-17 Loss Laptop No A laptop was lost by an employee while in tran... 2009
4 Michele Del Vicario, MD CA Healthcare Provider 6145.0 2009-11-20 Theft Desktop Computer No A shared Computer that was used for backup was... 2009

Dataset Analysis¶

In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5356 entries, 0 to 5355
Data columns (total 10 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   Name of Covered Entity            5356 non-null   object        
 1   State                             5341 non-null   object        
 2   Covered Entity Type               5352 non-null   object        
 3   Individuals Affected              5355 non-null   float64       
 4   Breach Submission Date            5356 non-null   datetime64[ns]
 5   Type of Breach                    5355 non-null   object        
 6   Location of Breached Information  5356 non-null   object        
 7   Business Associate Present        5356 non-null   object        
 8   Web Description                   4250 non-null   object        
 9   Year of Breach Submission         5356 non-null   int32         
dtypes: datetime64[ns](1), float64(1), int32(1), object(7)
memory usage: 397.6+ KB

The dataset contains records for 5356 healthcare data breaches, has one identifier variable(Name of Covered Entity), 4 Categorical(Covered Entity Type, Type of Breach, Location of Breached Information, and State), 1 boolean(string currently:Business Associate Present), 1 Date(Breach Submission Date), 1 Text (Web Description), and 1 quantitative (Individuals Affected).

In [ ]:
df.isnull().sum()
Out[ ]:
Name of Covered Entity                 0
State                                 15
Covered Entity Type                    4
Individuals Affected                   1
Breach Submission Date                 0
Type of Breach                         1
Location of Breached Information       0
Business Associate Present             0
Web Description                     1106
Year of Breach Submission              0
dtype: int64

The dataset does not appear to be missing a lot of data at first glance, although Web Descriptions are missing for nontrivial of breach observations. This may be in part due to the breach not being publicly announced via a website. The Web Descriptions may provide opportunities for feature creation as they may contain valuable information. There is also a possible of incorporating them utilizing sentiment analysis although caution should be taken relative to wanting to utilize most of the 5356 records, and losing 20% due to missing web descriptions(a natural occurrence for some breaches) is not desirable. Therefore, if they are used, the absence of one should hopefully result in a meaningful data point as well for a breach observation. This is unlikely though.

In [ ]:
df['State'].unique()
Out[ ]:
array(['TX', 'MO', 'AK', 'DC', 'CA', 'PA', 'TN', 'VA', 'NC', 'MI', 'MA',
       'MD', 'ID', 'IL', 'UT', 'AZ', 'RI', 'PR', 'FL', 'NM', 'NY', 'WY',
       'WI', 'WA', 'CT', 'AL', 'GA', 'SC', 'KY', 'MN', 'CO', 'NE', 'KS',
       'OH', 'NV', 'IN', 'OR', 'IA', 'OK', 'AR', 'MS', 'NH', 'MT', 'NJ',
       'WV', nan, 'LA', 'ND', 'HI', 'SD', 'ME', 'VT', 'DE'], dtype=object)

The dataset includes Puerto Rico & DC as states for convenience. Records with NAN states should be addressed in the data preparation phase. With High Dimensionality from encoding, these might be useful for visual analysis, but typically not machine learning unless geospatial analysis is central.

In [ ]:
df['Covered Entity Type'].value_counts()
Out[ ]:
Covered Entity Type
Healthcare Provider          3890
Business Associate            766
Health Plan                   686
Healthcare Clearing House      10
Name: count, dtype: int64

Healthcare Sector Organizations seem to fall under 4 types of classification on the HITECH Act of 2009, alongside HHS's treatment of the Sector. Healthcare Providers are entities that directly provide healthcare to patients or individuals. Business Associates other entities that far under specific HIPAA and HHS guidelines(https://www.hhs.gov/hipaa/for-professionals/privacy/guidance/business-associates/index.html). For the purpose of this report, they will be referred to as third party vendors to better differentiate them.

These will likely encoded for potential machine learning models and are important for understanding the problem of healthcare data problems. Observing the trends in which entities are breached and which entities affect the most individuals when breached are paramount.

In [ ]:
df['Type of Breach'].unique()
Out[ ]:
array(['Theft', 'Loss', 'Other', 'Unauthorized Access/Disclosure',
       'Hacking/IT Incident', nan, 'Loss, Theft', 'Improper Disposal',
       'Improper Disposal, Loss', 'Other, Theft', 'Loss, Other',
       'Improper Disposal, Loss, Theft', 'Unknown',
       'Theft, Unauthorized Access/Disclosure',
       'Hacking/IT Incident, Unauthorized Access/Disclosure',
       'Other, Unauthorized Access/Disclosure',
       'Hacking/IT Incident, Other', 'Other, Unknown',
       'Loss, Unauthorized Access/Disclosure, Unknown',
       'Loss, Theft, Unauthorized Access/Disclosure, Unknown',
       'Hacking/IT Incident, Other, Unauthorized Access/Disclosure',
       'Hacking/IT Incident, Theft, Unauthorized Access/Disclosure',
       'Improper Disposal, Theft', 'Hacking/IT Incident, Theft',
       'Hacking/IT Incident, Improper Disposal, Loss, Other, Theft, Unauthorized Access/Disclosure, Unknown',
       'Loss, Other, Theft',
       'Other, Theft, Unauthorized Access/Disclosure',
       'Improper Disposal, Theft, Unauthorized Access/Disclosure',
       'Loss, Unknown', 'Loss, Unauthorized Access/Disclosure',
       'Improper Disposal, Unauthorized Access/Disclosure'], dtype=object)
In [ ]:
df['Type of Breach'].value_counts()
Out[ ]:
Type of Breach
Hacking/IT Incident                                                                                    2599
Unauthorized Access/Disclosure                                                                         1293
Theft                                                                                                   979
Loss                                                                                                    205
Improper Disposal                                                                                       108
Other                                                                                                    75
Theft, Unauthorized Access/Disclosure                                                                    25
Loss, Theft                                                                                              15
Unknown                                                                                                  10
Hacking/IT Incident, Unauthorized Access/Disclosure                                                       8
Other, Unauthorized Access/Disclosure                                                                     6
Other, Theft                                                                                              3
Improper Disposal, Loss, Theft                                                                            3
Loss, Unauthorized Access/Disclosure                                                                      3
Improper Disposal, Loss                                                                                   3
Hacking/IT Incident, Theft, Unauthorized Access/Disclosure                                                2
Other, Theft, Unauthorized Access/Disclosure                                                              2
Hacking/IT Incident, Other                                                                                2
Other, Unknown                                                                                            2
Loss, Other                                                                                               2
Loss, Theft, Unauthorized Access/Disclosure, Unknown                                                      1
Hacking/IT Incident, Other, Unauthorized Access/Disclosure                                                1
Loss, Unauthorized Access/Disclosure, Unknown                                                             1
Improper Disposal, Theft                                                                                  1
Hacking/IT Incident, Theft                                                                                1
Hacking/IT Incident, Improper Disposal, Loss, Other, Theft, Unauthorized Access/Disclosure, Unknown       1
Loss, Other, Theft                                                                                        1
Improper Disposal, Theft, Unauthorized Access/Disclosure                                                  1
Loss, Unknown                                                                                             1
Improper Disposal, Unauthorized Access/Disclosure                                                         1
Name: count, dtype: int64
In [ ]:
len(df['Type of Breach'].unique())
Out[ ]:
31

Type of breach and location of breached information are interesting due to multiple categorical labels being applied to a small proportion of breaches. It's safe to assume that this is because these breaches involved multiple acts leading to the disclosure of data. After calculating the number of unique combinations, there are too many to include. As the independence/dependence of breach type combinations would be difficult to aggregate into their own categorical accurately and honestly, it is perhaps best to simply make new columns called first breach type and first location of breach to reduce the number of labels a more manageable number.

Dataset Preparation¶

Dataset analysis resulted in some obvious actions to take:

  1. Examine and treat the small number of missing values
  2. Generate First Breach Type Column and first location of breach to avoid multi-label values.
  3. Create cost per record based off Table 7(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7349636/) + IBM Cost of Data breach Reports (https://www.ibm.com/downloads/cas/3R8N1DZJ). Last few years imputed using Exponential Smoothing.
In [ ]:
# Simply use "Not Provided" for records without web descriptions.
df['Web Description'].fillna("Not Provided", inplace=True)
In [ ]:
df[df['Type of Breach'].isnull()]
Out[ ]:
Name of Covered Entity State Covered Entity Type Individuals Affected Breach Submission Date Type of Breach Location of Breached Information Business Associate Present Web Description Year of Breach Submission
54 Computer Program and Systems, Inc. (CPSI) AL Business Associate 768.0 2010-03-30 NaN Email Yes \N 2010

For the one null Type of Breach value, we are given the location of breached information, but not the type. Email suggests it was at least Hacking/IT Incident or Unauthorized Access/Disclosure. Because it was a third party vendor, and a third party associate was directly involved, it was likely an disclosure.There is also reference to this breach being an "Unauthorized Access/Disclosure" from https://search.r-project.org/CRAN/refmans/Ecdat/html/HHSCyberSecurityBreaches.html.

In [ ]:
df.at[54, "Type of Breach"] = "Unauthorized Access/Disclosure"
In [ ]:
df[df['Individuals Affected'].isnull()]
Out[ ]:
Name of Covered Entity State Covered Entity Type Individuals Affected Breach Submission Date Type of Breach Location of Breached Information Business Associate Present Web Description Year of Breach Submission
802 Valperaiso Fire Department IN Health Plan NaN 2013-09-03 Theft Desktop Computer No This case has been consolidated with another r... 2013
In [ ]:
df[df['Individuals Affected'].isnull()]['Web Description']
Out[ ]:
802    This case has been consolidated with another r...
Name: Web Description, dtype: object

This breach seems to be involved with a larger number of breaches across states:

https://www.healthdatamanagement.com/articles/tax-fraud-fueled-breach-hits-36th-fire-department

but this article does mention that it was 860 individuals affected.

In [ ]:
df.at[802, "Individuals Affected"] = 860
In [ ]:
df[df['Covered Entity Type'].isnull()]
Out[ ]:
Name of Covered Entity State Covered Entity Type Individuals Affected Breach Submission Date Type of Breach Location of Breached Information Business Associate Present Web Description Year of Breach Submission
967 HealthSource of Ohio Inc. OH NaN 8845.0 2014-02-26 Unauthorized Access/Disclosure Other Yes Health Source of Ohio, the covered entity (CE)... 2014
1042 Boston Medical Center MA NaN 15265.0 2014-04-29 Unauthorized Access/Disclosure Network Server Yes Boston Medical Center, the covered entity (CE)... 2014
1097 BioReference Laboratories Inc NJ NaN 3334.0 2014-07-23 Unauthorized Access/Disclosure Network Server Yes The covered entity (CE), BioReference Laborato... 2014
2310 Boys Town National Research Hospital NE NaN 2182.0 2018-05-09 Hacking/IT Incident Email Yes The covered entity (CE), Boys Town National Re... 2018

It is unclear why there are nulls for these covered entities. For simplicity sake, all 4 will be defined based on their own website.

In [ ]:
df.at[2310, "Covered Entity Type"] = "Healthcare Provider" # Hospital
df.at[1097, "Covered Entity Type"] = "Healthcare Provider" # Patient Testing Facility
df.at[1042, "Covered Entity Type"] = "Healthcare Provider" # Hospital/Research Center
df.at[967, "Covered Entity Type"] = "Healthcare Provider" # Multiple Healthcare Services Provider
In [ ]:
df[df['State'].isnull()]
Out[ ]:
Name of Covered Entity State Covered Entity Type Individuals Affected Breach Submission Date Type of Breach Location of Breached Information Business Associate Present Web Description Year of Breach Submission
243 Departamento de Salud de Puerto Rico NaN Healthcare Provider 2621.0 2011-02-22 Loss Other Portable Electronic Device No The covered entity (CE) reported that on March... 2011
1315 Puerto Rico Department of Heatlh - Medicaid Pr... NaN Health Plan 500.0 2015-04-22 Theft Other No The covered entity (CE) reported that on Febru... 2015
2208 Triple-S Advantage, Inc. NaN Health Plan 36305.0 2018-02-02 Unauthorized Access/Disclosure Paper/Films No Subsequent to Triple-S Management entering int... 2018
2675 Metro Santurce, Inc. d/b/a Hospital Pavia Sant... NaN Healthcare Provider 305737.0 2019-04-13 Hacking/IT Incident Network Server No The covered entities (CEs), Metro Santurce Inc... 2019
2706 Inspira Behavioral Care, Corp NaN Healthcare Provider 4246.0 2019-05-02 Theft Desktop Computer No Inspira Behavioral Care, Corp., the covered en... 2019
2716 Inmediata Health Group, Corp. NaN Healthcare Clearing House 1565338.0 2019-05-07 Unauthorized Access/Disclosure Network Server No Not Provided 2019
2746 Farmacia La Amistad Inc. NaN Healthcare Provider 2500.0 2019-05-24 Hacking/IT Incident Network Server No Farmacia La Amistad, the covered entity (CE), ... 2019
2824 Bayamon Medical Center Corp. NaN Healthcare Provider 422496.0 2019-07-19 Hacking/IT Incident Network Server No Bayamon Medical Center, the covered entity (CE... 2019
2826 Puerto Rico Women And Children's Hospital, LLC NaN Healthcare Provider 99943.0 2019-07-19 Hacking/IT Incident Network Server No Puerto Rico Women and Children’s Hospital, the... 2019
2916 Intramural Practice Plan - Medical Sciences Ca... NaN Healthcare Provider 439753.0 2019-09-16 Hacking/IT Incident Network Server No Intramural Practice Plan - Medical Sciences Ca... 2019
3422 Servicios de Salud Primarios de Barceloneta, I... NaN Healthcare Provider 60200.0 2020-09-04 Hacking/IT Incident Email No The covered entity (CE), Servicios de Salud Pr... 2020
4550 Laboratorio Clínico Caparros NaN Healthcare Provider 500.0 2022-03-06 Hacking/IT Incident Network Server No The covered entity (CE), Laboratorio Clínico C... 2022
4563 Laboratorio Clinico Toledo NaN Healthcare Provider 500.0 2022-03-14 Hacking/IT Incident Network Server No Not Provided 2022
5072 Doctors' Center Hospital NaN Healthcare Provider 1195220.0 2022-11-09 Hacking/IT Incident Network Server No Not Provided 2022
5345 Modern Cardiology Associates NaN Healthcare Provider 10000.0 2023-04-19 Hacking/IT Incident Network Server No Not Provided 2023

All records with missing states appear to be from Puerto Rico except for Inspira, which is New Jersey.

In [ ]:
df.at[2706	, "State"] = "NJ"
In [ ]:
df['State'].fillna("PR", inplace=True)
In [ ]:
df.isnull().sum()
Out[ ]:
Name of Covered Entity              0
State                               0
Covered Entity Type                 0
Individuals Affected                0
Breach Submission Date              0
Type of Breach                      0
Location of Breached Information    0
Business Associate Present          0
Web Description                     0
Year of Breach Submission           0
dtype: int64

There are now no null values in the dataset

Creating the first breach type column is a simple way of dealing with the multi-value label problem for Breach Type

In [ ]:
# split column by ',' and keep only the first column
df['First Breach Type'] = df['Type of Breach'].str.split(',').str[0]
df['First Location of Breach'] = df['Location of Breached Information'].str.split(',').str[0]

From the IBM data report from 2021, we had 499 USD for 2020. For 2009-2019, we were able to use the consolidated table from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7349636/. We needed values for 2021-2023. For some reason, the IBM cost of a data breach in 2022 only had the average cost of a data record across sectors, but healthcare records have always been more valuable than the average record. This may be in part to their research disclaimer about the strength of the US dollar making this value questionable for that year:

"This year, a strong U.S. dollar significantly influenced the
global cost analysis. The conversion from local currencies
to the U.S. dollar deflated the per record and average total
cost estimates. For purposes of consistency with prior
years, we decided to continue to use the same accounting
method rather than adjust the cost" (https://www.ibm.com/downloads/cas/OJDVQGRY).

Measuring the actual value outside of a currency is difficult, but the trend is obvious, that healthcare record cost values continue to rise for organizations. Please note, these average costs per records for an healthcare organization are estimations in and of themselves by IBM and the Ponemon Institute. Organizations tend to downplay the costs of a data breach, and do not include indirect costs typically, meaning these values could be under estimations.

Using the table, the values for 2021-2023 were imputed via Exponential Smoothing Forecasting due to the small number of values.

In [ ]:
healthcare_record_costs = {
    2009: 270,
    2010: 294,
    2011: 240,
    2012: 233,
    2013: 296,
    2014: 359,
    2015: 363,
    2016: 355,
    2017: 380,
    2018: 408,
    2019: 429,
    2020: 499
}

costs_df = pd.DataFrame(list(healthcare_record_costs.items()), columns=["Year", "Cost"])
costs_df.set_index("Year", inplace=True)

model = ExponentialSmoothing(costs_df, seasonal=None, trend='additive', damped_trend=True)
model_fit = model.fit()

n_missing_years = 3
forecast = model_fit.forecast(n_missing_years)

for i in range(n_missing_years):
    year = 2021 + i
    value = forecast.iloc[i]
    healthcare_record_costs[year] = value

print(healthcare_record_costs)
{2009: 270, 2010: 294, 2011: 240, 2012: 233, 2013: 296, 2014: 359, 2015: 363, 2016: 355, 2017: 380, 2018: 408, 2019: 429, 2020: 499, 2021: 518.9987039305105, 2022: 538.8897530228002, 2023: 558.6645878704182}
d:\Python\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning:

An unsupported index was provided and will be ignored when e.g. forecasting.

d:\Python\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:836: ValueWarning:

No supported index is available. Prediction results will be given with an integer index beginning at `start`.

d:\Python\Lib\site-packages\statsmodels\tsa\base\tsa_model.py:836: FutureWarning:

No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception.

In [ ]:
# floor the floats down to be consistent with the other estimated values
healthcare_record_costs = {year: math.floor(value) for year, value in healthcare_record_costs.items()}
In [ ]:
df['Year of Breach Submission']
Out[ ]:
0       2009
1       2009
2       2009
3       2009
4       2009
        ... 
5351    2023
5352    2023
5353    2023
5354    2023
5355    2023
Name: Year of Breach Submission, Length: 5356, dtype: int32
In [ ]:
# Create average cost per healthcare record per year
df['Average Cost Per Record Per Breach Year'] = df['Year of Breach Submission'].apply(lambda x: healthcare_record_costs[x])
In [ ]:
df.head()
Out[ ]:
Name of Covered Entity State Covered Entity Type Individuals Affected Breach Submission Date Type of Breach Location of Breached Information Business Associate Present Web Description Year of Breach Submission First Breach Type First Location of Breach Average Cost Per Record Per Breach Year
0 Brooke Army Medical Center TX Healthcare Provider 1000.0 2009-10-21 Theft Paper/Films No A binder containing the protected health infor... 2009 Theft Paper/Films 270
1 Mid America Kidney Stone Association, LLC MO Healthcare Provider 1000.0 2009-10-28 Theft Network Server No Five desktop computers containing unencrypted ... 2009 Theft Network Server 270
2 Alaska Department of Health and Social Services AK Healthcare Provider 501.0 2009-10-30 Theft Other, Other Portable Electronic Device No The Alaska Department of Health and Social Ser... 2009 Theft Other 270
3 Health Services for Children with Special Need... DC Health Plan 3800.0 2009-11-17 Loss Laptop No A laptop was lost by an employee while in tran... 2009 Loss Laptop 270
4 Michele Del Vicario, MD CA Healthcare Provider 6145.0 2009-11-20 Theft Desktop Computer No A shared Computer that was used for backup was... 2009 Theft Desktop Computer 270

We can now estimate the total breach cost for that breach based off Average Cost Per Record Per Breach Year and Individuals Affected with a scaling factor due to per record costs decreasing when breaches are very large.

In [ ]:
df["Estimated Total Breach Cost"] = df['Average Cost Per Record Per Breach Year']  * (df['Individuals Affected'] ** 0.9)
In [ ]:
df['Estimated Total Breach Cost'].mean()
Out[ ]:
7984134.376931305

This mean total healthcare breach based off record approaches many estimations for each year in IBM's cost of a data breach reports, but this represents a major amount of bias and uncertainty with any further analysis. Given companies are not forthcoming nor even motivated to produce accurate results when reporting breaches, this may be one of the only ways to estimate the cost without a comprehensive study of quarterly expense reports for the organizations that have this information publicly available.

Exploratory Analysis¶

A presumption is that the most frequent breach type for healthcare would change over the last 15 years due to the increased reliance on tech and rise of digital hacking. Healthcare records were known to be valuable even before the rise of the dark web, and would often be physically burglarized in order to be resold on black markets. Incentives would included in the HITECH Act of 2009 to encourage the usr of electronic medical record and health systems that would help digitize operations and reduce the complex management of paper/film records.

In [ ]:
# Count the number of breaches per year, grouped by type
breach_counts = df.groupby(['Year of Breach Submission', 'First Breach Type']).size().reset_index(name='Count')

breach_counts = breach_counts[(breach_counts['Year of Breach Submission'] != 2009) & (breach_counts['Year of Breach Submission'] != 2023) & (breach_counts['First Breach Type'] != 'Unknown')]
# Calculate the total counts per year
total_counts = breach_counts.groupby('Year of Breach Submission')['Count'].sum().reset_index(name='Total')

# Merge breach_counts with total_counts to calculate proportions
breach_counts = breach_counts.merge(total_counts, on='Year of Breach Submission')
breach_counts['Proportion'] = (breach_counts['Count'] / breach_counts['Total']) * 100

# Create a list of formatted percentage strings
formatted_percentages = [f"{round(prop, 1)}%" for prop in breach_counts['Proportion']]

# Create a stacked bar chart using with proportions as inner bar text labels
fig = px.bar(breach_counts, x='Year of Breach Submission', y='Count', barmode='stack', color='First Breach Type', text=formatted_percentages)

# Iterate through the grouped DataFrame and add total count annotations above the bars
for year, grp in breach_counts.groupby('Year of Breach Submission'):
    total_count = grp['Count'].sum()
    fig.add_annotation(
        x=year,
        y=total_count,
        text=str(total_count),
        showarrow=False,
        font=dict(size=12),
        yshift=5
    )

# Set auto-resizing
fig.update_layout(
    title=dict(
        text='Hackers & Healthcare<br><sub>Number of Breaches per Year by Breach Type (2010-2022)</sub>',
        x=0.5,
        y=0.95,
        font=dict(size=18)
    ),
    autosize=True,
    margin=go.layout.Margin(l=0, r=0, t=30, b=0),
    width=None,
    height=None
)

# Get total counts per year
total_counts = breach_counts.groupby('Year of Breach Submission')['Count'].sum().reset_index(name='Total')

# Add labels for total counts above each bar
annotations = []
for index, row in total_counts.iterrows():
    annotations.append(go.layout.Annotation(
        x=row['Year of Breach Submission'],
        y=row['Total'] + 15,
        text=str(row['Total']),
        showarrow=False,
        font=dict(size=8, color='black')
    ))
fig.update_traces(textfont=dict(size=5.5))
fig.update_yaxes(automargin=True)
fig.update_layout(annotations=annotations)

fig.show()

There are several trends that can be deduced from the above annual stacked bar chart concerning healthcare breaches. First, breaches are occurring more frequently each year. Secondly, Cyber Security incidents have been accounting for the majority of data breaches since 2017. This trend will likely continue for years to come and probably is not very surprising. What is distinct about healthcare breaches though, is that these records are valuable enough to physically burglarize. From 2010 to 2014, this accounted for the majority of breaches. This is due to the high price medical records would fetch on black markets.

In [ ]:
def format_million(value):
    return f"{round(value / 1_000_000, 1)}M"

# Calculate the total number of individuals affected per year, grouped by breach type
affected_counts = df.groupby(['Year of Breach Submission', 'First Breach Type'])['Individuals Affected'].sum().reset_index(name='Affected')

affected_counts = affected_counts[(affected_counts['Year of Breach Submission'] != 2009) & (affected_counts['Year of Breach Submission'] != 2023) & (affected_counts['First Breach Type'] != 'Unknown')]

# Create a stacked bar chart
fig = px.bar(affected_counts, x='Year of Breach Submission', y='Affected', barmode='stack', color='First Breach Type', title='The Hacked & Affected: Individuals Affected per Year by Type (2010-2022)')

# Iterate through the grouped DataFrame and add total affected labels above the bars
for year, grp in affected_counts.groupby('Year of Breach Submission'):
    total_count = grp['Affected'].sum()
    fig.add_annotation(
        x=year,
        y=total_count,
        text=format_million(total_count),
        showarrow=False,
        font=dict(size=12),
        yshift=5
    )

# Add title & subtitle
fig.update_layout(
    title=dict(
        text='The Hacked & Affected<br><sub>Individuals Affected per Year by Type (2010-2022)</sub>',
        x=0.5,
        y=0.95,
        font=dict(size=18)
    ),
    autosize=True,
    margin=go.layout.Margin(l=0, r=0, t=60, b=0),
    width=None,
    height=None
)

fig.update_traces(textfont=dict(size=5.5))
fig.update_yaxes(automargin=True)

fig.show()

When broken down into cumulative records lost by in healthcare breaches by type, again, cybersecurity incidents constituted a vast majority of records exposed annually, while physical theft used to be. This is not surprising as digitally stealing millions of records is easier than carrying millions of records physically.It's also obvious that bad actors(state sponsored and otherwise), caught onto to this being the much more effective and safe way of stealing the same records they would attempt to burglarize previously. Why risk several felonies stateside when you can steal the same data from a location where you will never be charged? 2015 also holds the landmark Anthem data breach, which is the largest healthcare data breach on record with about 80 million PII records stolen including social security numbers, health identification numbers, names, and birthdays. Anthem's total settlement and punitive fees to Indiana's State attorney, Health and Human Services, and class action lawsuits filed by victims totaled around 180 million US dollars by 2019. This breach was likely sponsored by Chinese intelligence groups, who have been hoarding the data supposedly, as no fraud acts have been directly linked to the leak. But, this was the breach that convinced hacker groups around the world, that this is the way to steal the identities of American citizens from now on. Complete medical records can fetch upwards of about 1000$ per pop in some cases per a study by Experian (https://www.experian.com/blogs/ask-experian/heres-how-much-your-personal-information-is-selling-for-on-the-dark-web/) on dark web sales. Note that this is the value of the record on the black market, not the cost of exposing a record for an organization.

Anthem's hack in some ways possibly lowered the average cost per healthcare record.

In [ ]:
# Filter out years 2009 and 2023 again
filtered_df = df[(df['Year of Breach Submission'] != 2009) & (df['Year of Breach Submission'] != 2023)]

# Create time series plot
fig = px.line(filtered_df, x='Year of Breach Submission', y='Average Cost Per Record Per Breach Year', title='Average Cost Per Record per Year', markers=True)

fig.update_layout(
    title=dict(
        text='Average Cost Per Record Annually(2010-2022)</sub>',
        x=0.5,
        y=0.95,
        font=dict(size=18)
    ),
    autosize=True,
    margin=go.layout.Margin(l=0, r=0, t=60, b=0),
    width=None,
    height=None
)

fig.update_traces(textfont=dict(size=5.5))
fig.update_yaxes(automargin=True)

fig.show()

Anthem's effect on the data as an outlier can be observed more drastically in the estimated total breach cost.

In [ ]:
# Filter out years 2009 and 2023 again
filtered_df = df[(df['Year of Breach Submission'] != 2009) & (df['Year of Breach Submission'] != 2023)]

# get average cost per year
average_cost_per_year = filtered_df.groupby('Year of Breach Submission')['Estimated Total Breach Cost'].mean().reset_index()

# Create a time series plot
fig = px.line(average_cost_per_year, x='Year of Breach Submission', y='Estimated Total Breach Cost', title='Average Cost of a Breach per Year', markers=True)

fig.update_layout(
    title=dict(
        text='Average Cost of a Breach Annually (2010-2022)',
        x=0.5,
        y=0.95,
        font=dict(size=18)
    ),
    autosize=True,
    margin=go.layout.Margin(l=0, r=0, t=60, b=0),
    width=None,
    height=None
)

fig.show()

In many ways, anthem's breach shows the extent in which the estimator for total breach cost utilizing avg cost per record is inadequate, particularly in the case of outliers.

In [ ]:
df.iloc[1253]
Out[ ]:
Name of Covered Entity                                                           Anthem Inc.
State                                                                                     IN
Covered Entity Type                                                              Health Plan
Individuals Affected                                                              78800000.0
Breach Submission Date                                                   2015-02-13 00:00:00
Type of Breach                                                           Hacking/IT Incident
Location of Breached Information                                              Network Server
Business Associate Present                                                                No
Web Description                            Anthem, Inc. has agreed to pay $16 million to ...
Year of Breach Submission                                                               2015
First Breach Type                                                        Hacking/IT Incident
First Location of Breach                                                      Network Server
Average Cost Per Record Per Breach Year                                                  363
Estimated Total Breach Cost                                                4642802619.242887
Name: 1253, dtype: object

While Anthem suffered severe costs not observed in not just healthcare data breaches, but any data breaches, it did not cost the company 46 billion dollars. Many breach datasets remove mega breaches from their studies due to the bias they introduce, as the cost to an organization per record lowers the more records lost, and most breaches do not result in 78 million records exposed.

Lastly, to ensure trends in healthcare data breaches are fully observed in this dataset. The location of breach types and type of breach should indicate the type of specific attack usually carried on healthcare providers specifically as covered entity types, and that is ransomware.

In [ ]:
filtered_df = df[(df['Year of Breach Submission'] != 2009) & (df['Year of Breach Submission'] != 2023)]

# Group by year and covered entity type, and sum the 'Individuals Affected' column
grouped_df = filtered_df.groupby(['Year of Breach Submission', 'Covered Entity Type'])['Individuals Affected'].sum().reset_index()

# Create a bubble chart
fig = px.scatter(grouped_df, 
                 x='Year of Breach Submission', 
                 y='Individuals Affected', 
                 size='Individuals Affected', 
                 color='Covered Entity Type', 
                 title='Cumulative Individuals Affected by Breaches per Year',
                 labels={'Individuals Affected': 'Total Individuals Affected'},
                 size_max=60)

fig.update_layout(
    title=dict(
        text='Healthcare Providers Hit the Most<br><sub>Cumulative Individuals Affected by Breaches per Year per Covered Entity Type</sub>',
        x=0.5,
        y=0.95,
        font=dict(size=18)
    ),
    autosize=True,
    margin=go.layout.Margin(l=0, r=0, t=60, b=0),
    width=None,
    height=None
)

fig.show()

In the above annual bubble chart, once again Anthem muddied the waters as a Health Plan entity type, but it's important to observe that besides that outlier, healthcare providers such as hospitals and care facilities have been targeted the most in recent years, which is particularly troubling given the other issues these healthcare providers have been experiencing with Covid-19 and overwhelmed staffing. Hospitals in particular are often hit with ransomware attacks.

In [ ]:
# Create a treemap
fig = px.treemap(df, 
                 path=['Covered Entity Type', 'First Breach Type', 'First Location of Breach'], 
                 values='Individuals Affected', 
                 color='Covered Entity Type', 
                 title='Individuals Affected by Breaches: Entity Type, Breach Type, and Location')

fig.update_layout(
    title=dict(
        text='Individuals Affected by Breaches<br><sub>Grouped by Entity Type, Breach Type, and Location</sub>',
        x=0.5,
        y=0.95,
        font=dict(size=18)
    ),
    autosize=True,
    margin=go.layout.Margin(l=0, r=0, t=60, b=0),
    width=None,
    height=None
)

fig.show()

This can be observed by the above treemap. One can see that hacking has cost every healthcare entity type the most, with network servers and email based phishing attacks resulting in the breach across the board as well. These are most likely ransomware attacks, which are even more profitable to bad actors than normal breaches and costlier to healthcare organizations. Hackers are able to double dip when they successfully execute ransomware attacks against healthcare organizations. They exfiltrate the encrypted data, and then they can ransom the decryption keys back to providers, who given the nature of their work, are more willing to pay in order to access their saving and provide care to their patients.

Feature Engineering & Time Series Decomposition¶

For most robust statistical analysis and machine learning, all the categorical variables should be encoded in order to be able to be processed by machine learning algorithms that expect all numerical data types.

In [ ]:
df.dtypes
Out[ ]:
Name of Covered Entity                             object
State                                              object
Covered Entity Type                                object
Individuals Affected                              float64
Breach Submission Date                     datetime64[ns]
Type of Breach                                     object
Location of Breached Information                   object
Business Associate Present                         object
Web Description                                    object
Year of Breach Submission                           int32
First Breach Type                                  object
First Location of Breach                           object
Average Cost Per Record Per Breach Year             int64
Estimated Total Breach Cost                       float64
dtype: object
  1. Business Associate Present can be encoded into a variable as to 0(no) or 1(yes)
  2. Covered Entity Type will be dummy encoded into 4 variables
  3. First Breach Type will be dummy encoded into 7 variables
  4. The other candidates for encoding will not be due to high dimensionality.
In [ ]:
df['Business Associate Present'] = df['Business Associate Present'].replace({'Yes':1, 'No' :0}).astype(int)
In [ ]:
df = pd.concat([df, pd.get_dummies(df['Covered Entity Type'], prefix="Entity Type").astype(int)], axis =1)
df = pd.concat([df, pd.get_dummies(df['First Breach Type'], prefix="First Breach Type").astype(int)], axis =1)
In [ ]:
df.columns
Out[ ]:
Index(['Name of Covered Entity', 'State', 'Covered Entity Type',
       'Individuals Affected', 'Breach Submission Date', 'Type of Breach',
       'Location of Breached Information', 'Business Associate Present',
       'Web Description', 'Year of Breach Submission', 'First Breach Type',
       'First Location of Breach', 'Average Cost Per Record Per Breach Year',
       'Estimated Total Breach Cost', 'Entity Type_Business Associate',
       'Entity Type_Health Plan', 'Entity Type_Healthcare Clearing House',
       'Entity Type_Healthcare Provider',
       'First Breach Type_Hacking/IT Incident',
       'First Breach Type_Improper Disposal', 'First Breach Type_Loss',
       'First Breach Type_Other', 'First Breach Type_Theft',
       'First Breach Type_Unauthorized Access/Disclosure',
       'First Breach Type_Unknown'],
      dtype='object')

We can now generate a correlation heatmap to view potential relationships among relevant numerical types.

In [ ]:
select_cols = ['Individuals Affected', 'Business Associate Present','Average Cost Per Record Per Breach Year', 'Estimated Total Breach Cost',
               'Entity Type_Health Plan', 'Entity Type_Healthcare Clearing House', 'Entity Type_Healthcare Provider',
               'First Breach Type_Hacking/IT Incident', 'First Breach Type_Improper Disposal', 'First Breach Type_Loss',
               'First Breach Type_Other', 'First Breach Type_Theft', 'First Breach Type_Unauthorized Access/Disclosure',
               'First Breach Type_Unknown']
In [ ]:
df_selected = df[select_cols].copy()
corr_matrix = df_selected.corr(method="pearson")
In [ ]:
fig = ff.create_annotated_heatmap(
    z=corr_matrix.values,
    x=list(corr_matrix.columns),
    y=list(corr_matrix.index),
    annotation_text=corr_matrix.round(2).values,
    showscale=True,
    colorscale='Viridis',
)

fig.update_layout(
    title='Correlation Heatmap',
    xaxis=dict(title='Columns', side='bottom', tickvals=[], ticktext=[]),
    yaxis=dict(title='Columns', tickvals=[], ticktext=[]),
    autosize=True
)

From the correlation heatmap, it was determined that there was very little meaningful correlation between variables that would indicate strong linear predictive ability in the dataset. Estimated Total Breach Cost was generated using individuals affected, so naturally the correlation (0.98) would be high. Although Estimated Total Cost would be very enlightening to predict, it was calculated in a very biased way, and will probably not be utilized for predictive tasks. Hacking and the average cost per record being decently correlated (0.51) is more of a consequence of their similar positive relationships with time. There is correlation between some of the categorical labels that have been encoded, but they were encoded from the same categorical, so this is also not too helpful.

If multi-variable regression were to be utilized, a non-linear regressor would probably be best in hopes that there are some non-linear relationships that could be captured. For time series forecasting, it is possible that univariate methods might perform more accurately. The forecasting target variable(s) are essentially the numbers of different kinds of breaches. Essentially, time series datasets need partitioned from the existing set for each kind.

For time series forecasting, we need to partition and create multiple dataframes according to time frame and frequency. We will initially try monthly and cut out May 2023 for being incomplete.

In [ ]:
df_time_indexed = df[df['Breach Submission Date'] <= '2023-04-30'].copy().set_index('Breach Submission Date')
df_time_indexed_cyber = df[(df['First Breach Type'] == "Hacking/IT Incident") & (df['Breach Submission Date'] <= '2023-04-30')].copy().set_index('Breach Submission Date')
df_time_indexed_network = df[(df['First Location of Breach'] == "Network Server") & (df['Breach Submission Date'] <= '2023-04-30')].copy().set_index('Breach Submission Date')
df_time_indexed_email = df[(df['First Location of Breach'] == "Email") & (df['Breach Submission Date'] <= '2023-04-30')].copy().set_index('Breach Submission Date')
In [ ]:
monthly_breaches = df_time_indexed.resample('M').size()
monthly_breaches_cyber = df_time_indexed_cyber.resample('M').size()
monthly_breaches_network = df_time_indexed_network.resample('M').size()
monthly_breaches_email = df_time_indexed_email.resample('M').size()

Will want the dataframes eventually for easier splitting.

In [ ]:
monthly_breaches_df = monthly_breaches.to_frame(name='Number of Breaches Per Month')
monthly_breaches_cyber_df = monthly_breaches_cyber.to_frame(name='Number of Cyber Breaches Per Month')
monthly_breaches_network_df = monthly_breaches_network.to_frame(name='Number of Network Breaches Per Month')
monthly_breaches_email_df = monthly_breaches_email.to_frame(name='Number of Email Breaches Per Month')
In [ ]:
monthly_breaches_df.rename(columns={'Breach Submission Date': 'Month'}, inplace=True)
monthly_breaches_cyber_df.rename(columns={'Breach Submission Date': 'Month'}, inplace=True)
monthly_breaches_network_df.rename(columns={'Breach Submission Date': 'Month'}, inplace=True)
monthly_breaches_email_df.rename(columns={'Breach Submission Date': 'Month'}, inplace=True)

After partitioning the dataset into different series of interest, the series can be examined (seasonality, trends, and autocorrelation) in order make sure they are stationary before modeling. Augmented Dickey–Fuller tests will be performed to verify

In [ ]:
decomposition = seasonal_decompose(monthly_breaches, model='additive')
decomposition_cyber = seasonal_decompose(monthly_breaches_cyber, model='additive')
decomposition_network = seasonal_decompose(monthly_breaches_network, model='additive')
decomposition_email = seasonal_decompose(monthly_breaches_email, model='additive')
In [ ]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=monthly_breaches.index,
                         y=decomposition.seasonal,
                         mode='lines',
                         name='Seasonal Component'))

fig.update_layout(title='Seasonal Decomposition of Time Series For Monthly Breaches',
                  xaxis_title='Month',
                  yaxis_title='Seasonal Component')

fig.show()
In [ ]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=monthly_breaches_cyber.index,
                         y=decomposition_cyber.seasonal,
                         mode='lines',
                         name='Seasonal Component'))

fig.update_layout(title='Seasonal Decomposition of Time Series For Monthly Cyber Breaches',
                  xaxis_title='Month',
                  yaxis_title='Seasonal Component')

fig.show()
In [ ]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=monthly_breaches_network.index,
                         y=decomposition_network.seasonal,
                         mode='lines',
                         name='Seasonal Component'))

fig.update_layout(title='Seasonal Decomposition of Time Series For Monthly Network Breaches',
                  xaxis_title='Month',
                  yaxis_title='Seasonal Component')

fig.show()
In [ ]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=monthly_breaches_email.index,
                         y=decomposition_email.seasonal,
                         mode='lines',
                         name='Seasonal Component'))

fig.update_layout(title='Seasonal Decomposition of Time Series For Monthly Email Breaches',
                  xaxis_title='Month',
                  yaxis_title='Seasonal Component')

fig.show()
In [ ]:
# Calculate the rolling mean with a window size of 12 (yearly trend)
trend = monthly_breaches.rolling(window=12).mean()

fig = go.Figure()
fig.add_trace(go.Scatter(x=monthly_breaches.index, y=monthly_breaches, mode='lines', name='Original Data'))
fig.add_trace(go.Scatter(x=trend.index, y=trend, mode='lines', name='Trend'))

fig.update_layout(title='Trend Analysis All Breaches', xaxis_title='Time', yaxis_title='Number of Breaches')
fig.show()
In [ ]:
# Calculate the rolling mean with a window size of 12 (yearly trend)
trend = monthly_breaches_cyber.rolling(window=12).mean()

fig = go.Figure()
fig.add_trace(go.Scatter(x=monthly_breaches_cyber.index, y=monthly_breaches_cyber, mode='lines', name='Original Data'))
fig.add_trace(go.Scatter(x=trend.index, y=trend, mode='lines', name='Trend'))

fig.update_layout(title='Trend Analysis Cyber', xaxis_title='Time', yaxis_title='Number of Cyber Breaches')
fig.show()
In [ ]:
# Calculate the rolling mean with a window size of 12 (yearly trend)
trend = monthly_breaches_network.rolling(window=12).mean()

fig = go.Figure()
fig.add_trace(go.Scatter(x=monthly_breaches_network.index, y=monthly_breaches_network, mode='lines', name='Original Data'))
fig.add_trace(go.Scatter(x=trend.index, y=trend, mode='lines', name='Trend'))

fig.update_layout(title='Trend Analysis Network', xaxis_title='Time', yaxis_title='Number of Network Breaches')
fig.show()
In [ ]:
# Calculate the rolling mean with a window size of 12 (yearly trend)
trend = monthly_breaches_email.rolling(window=12).mean()

fig = go.Figure()
fig.add_trace(go.Scatter(x=monthly_breaches_email.index, y=monthly_breaches_email, mode='lines', name='Original Data'))
fig.add_trace(go.Scatter(x=trend.index, y=trend, mode='lines', name='Trend'))

fig.update_layout(title='Trend Analysis Email', xaxis_title='Time', yaxis_title='Number of Email Breaches')
fig.show()
In [ ]:
lags = range(1, 25) 
autocorrelation = [monthly_breaches_df['Number of Breaches Per Month'].autocorr(lag=lag) for lag in lags]


autocorr_data = pd.DataFrame({'Lag': lags, 'Autocorrelation': autocorrelation})

fig = go.Figure()
fig.add_trace(go.Bar(x=autocorr_data['Lag'], y=autocorr_data['Autocorrelation']))

fig.update_layout(title='Autocorrelation Analysis For All Breaches', xaxis_title='Lag', yaxis_title='Autocorrelation')
fig.show()
In [ ]:
lags = range(1, 25) 
autocorrelation = [monthly_breaches_cyber_df['Number of Cyber Breaches Per Month'].autocorr(lag=lag) for lag in lags]


autocorr_data = pd.DataFrame({'Lag': lags, 'Autocorrelation': autocorrelation})

fig = go.Figure()
fig.add_trace(go.Bar(x=autocorr_data['Lag'], y=autocorr_data['Autocorrelation']))

fig.update_layout(title='Autocorrelation Analysis For Cyber Breaches', xaxis_title='Lag', yaxis_title='Autocorrelation')
fig.show()
In [ ]:
lags = range(1, 25) 
autocorrelation = [monthly_breaches_network_df['Number of Network Breaches Per Month'].autocorr(lag=lag) for lag in lags]


autocorr_data = pd.DataFrame({'Lag': lags, 'Autocorrelation': autocorrelation})

fig = go.Figure()
fig.add_trace(go.Bar(x=autocorr_data['Lag'], y=autocorr_data['Autocorrelation']))

fig.update_layout(title='Autocorrelation Analysis For Network Breaches', xaxis_title='Lag', yaxis_title='Autocorrelation')
fig.show()
In [ ]:
lags = range(1, 25) 
autocorrelation = [monthly_breaches_email_df['Number of Email Breaches Per Month'].autocorr(lag=lag) for lag in lags]


autocorr_data = pd.DataFrame({'Lag': lags, 'Autocorrelation': autocorrelation})

fig = go.Figure()
fig.add_trace(go.Bar(x=autocorr_data['Lag'], y=autocorr_data['Autocorrelation']))

fig.update_layout(title='Autocorrelation Analysis For Email Breaches', xaxis_title='Lag', yaxis_title='Autocorrelation')
fig.show()

There is a linear trend that can be observed from the trend graphs, which is not unexpected from the simple exploratory analysis. First order differencing may be all that is required to make the series stationary.

In [ ]:
def adf_test(series):
    result = adfuller(series)
    print('ADF Statistic: {}'.format(result[0]))
    print('p-value: {}'.format(result[1]))
    print('Critical Values:')
    for key, value in result[4].items():
        print('\t{}: {}'.format(key, value))

    if result[1] <= 0.05:
        print("The time series is stationary.")
    else:
        print("The time series is non-stationary.")
In [ ]:
adf_test(monthly_breaches)
ADF Statistic: -1.4103206432385045
p-value: 0.5773183932548773
Critical Values:
	1%: -3.472703119504854
	5%: -2.880131672353732
	10%: -2.5766826861130268
The time series is non-stationary.

Because the time series is non stationary, we should apply seasonal differencing to transform it, but we should also split the data.

For training, we'll use all data up until and through 2020 and then 2021, 2022, & 2023 for validation.

In [ ]:
train_total = monthly_breaches[monthly_breaches.index.year <= 2020]
val_total = monthly_breaches[monthly_breaches.index.year >= 2021]
train_cyber = monthly_breaches_cyber[monthly_breaches_cyber.index.year <= 2020]
val_cyber = monthly_breaches_cyber[monthly_breaches_cyber.index.year >= 2021]
train_network = monthly_breaches_network[monthly_breaches_network.index.year <= 2020]
val_network = monthly_breaches_network[monthly_breaches_network.index.year >= 2021]
train_email = monthly_breaches_email[monthly_breaches_email.index.year <= 2020]
val_email = monthly_breaches_email[monthly_breaches_email.index.year >= 2021]
In [ ]:
total_differenced_series_train = train_total - train_total.shift(1)
total_differenced_series_train.dropna(inplace=True)
adf_test(total_differenced_series_train)
ADF Statistic: -4.405254874250095
p-value: 0.0002905404844863888
Critical Values:
	1%: -3.4851223522012855
	5%: -2.88553750045158
	10%: -2.5795685622144586
The time series is stationary.
In [ ]:
total_differenced_series_val = val_total - val_total.shift(1)
total_differenced_series_val.dropna(inplace=True)
adf_test(total_differenced_series_val)
ADF Statistic: -5.801423150268276
p-value: 4.621409050534281e-07
Critical Values:
	1%: -3.7238633119999998
	5%: -2.98648896
	10%: -2.6328004
The time series is stationary.
In [ ]:
def difference_series(series, lag=1):
    differenced_series = series - series.shift(lag)
    differenced_series.dropna(inplace=True)
    return differenced_series
In [ ]:
cyber_differenced_series_train = difference_series(train_cyber)
cyber_differenced_series_val = difference_series(val_cyber)
network_differenced_series_train = difference_series(train_network)
network_differenced_series_val = difference_series(val_network)
email_differenced_series_train = difference_series(train_email)
email_differenced_series_val = difference_series(val_email)
In [ ]:
adf_test(cyber_differenced_series_train)
adf_test(network_differenced_series_train)
adf_test(email_differenced_series_train)
adf_test(cyber_differenced_series_val)
adf_test(network_differenced_series_val)
adf_test(email_differenced_series_val)
ADF Statistic: -2.908816604131042
p-value: 0.044332310506121776
Critical Values:
	1%: -3.4885349695076844
	5%: -2.887019521656941
	10%: -2.5803597920604915
The time series is stationary.
ADF Statistic: -3.3429397262549263
p-value: 0.013064831155840933
Critical Values:
	1%: -3.485585145896754
	5%: -2.885738566292665
	10%: -2.5796759080663887
The time series is stationary.
ADF Statistic: -3.3384710601041245
p-value: 0.013243988196147794
Critical Values:
	1%: -3.4865346059036564
	5%: -2.8861509858476264
	10%: -2.579896092790057
The time series is stationary.
ADF Statistic: -5.457767771580996
p-value: 2.558808583928109e-06
Critical Values:
	1%: -3.7238633119999998
	5%: -2.98648896
	10%: -2.6328004
The time series is stationary.
ADF Statistic: -5.04763693583765
p-value: 1.784751070709593e-05
Critical Values:
	1%: -3.7238633119999998
	5%: -2.98648896
	10%: -2.6328004
The time series is stationary.
ADF Statistic: -2.3847732934582515
p-value: 0.1460802333957621
Critical Values:
	1%: -3.859073285322359
	5%: -3.0420456927297668
	10%: -2.6609064197530863
The time series is non-stationary.

Now that the different relevant time series is fully preprocessed, we can move onto model selection and validation.

Model Selection¶

Due to each time series only have about 160 monthly data points. It makes more sense to use more traditional time series forecasting models instead of a deep learning model, which would be better suited for much larger dataset and finer grained time series data such as SARIMA or Exponential Smoothing.

In [ ]:
def fit_arima_and_forecast_validate(train, val):
    model = pm.auto_arima(train)
    
    pred_train = model.predict_in_sample()
    pred_val = model.predict(n_periods=len(val))
    
    return model, pred_train, pred_val
In [ ]:
def fit_exponential_smoothing_and_forecast_validate(train, val, seasonal_periods=12):
    model = ExponentialSmoothing(train, seasonal_periods=seasonal_periods, seasonal='add', trend='add').fit()
    
    pred_train = model.fittedvalues
    pred_val = model.forecast(steps=len(val))
    
    return model, pred_train, pred_val

Do training and validation operations

In [ ]:
model_total, pred_total_train, pred_total_val = fit_arima_and_forecast_validate(train_total, val_total)
model_cyber, pred_cyber_train, pred_cyber_val = fit_arima_and_forecast_validate(train_cyber, val_cyber)
model_network, pred_network_train, pred_network_val = fit_arima_and_forecast_validate(train_network, val_network)
model_email, pred_email_train, pred_email_val = fit_arima_and_forecast_validate(train_email, val_email)

model_total_es, pred_total_es_train, pred_total_es_val = fit_exponential_smoothing_and_forecast_validate(train_total, val_total)
model_cyber_es, pred_cyber_es_train, pred_cyber_es_val = fit_exponential_smoothing_and_forecast_validate(train_cyber, val_cyber)
model_network_es, pred_network_es_train, pred_network_es_val = fit_exponential_smoothing_and_forecast_validate(train_network, val_network)
model_email_es, pred_email_es_train, pred_email_es_val = fit_exponential_smoothing_and_forecast_validate(train_email, val_email)
In [ ]:
def plot_actual_vs_predicted(train, val, pred_train, pred_val, title):
    # Reverse the differencing transformation
    train_reversed = train + train.iloc[0]
    val_reversed = val + val.iloc[0]
    pred_train_reversed = pred_train + train.iloc[0]
    pred_val_reversed = pred_val + val.iloc[0]
    fig = go.Figure()

    # Add training data (actual and predicted)
    fig.add_trace(go.Scatter(x=train_reversed.index, y=train_reversed, mode='markers', name='Actual Train'))
    fig.add_trace(go.Scatter(x=train_reversed.index, y=pred_train_reversed, mode='lines', name='Predicted Train'))

    # Add validation data (actual and predicted)
    fig.add_trace(go.Scatter(x=val_reversed.index, y=val_reversed, mode='markers', name='Actual Validation'))
    fig.add_trace(go.Scatter(x=val_reversed.index, y=pred_val_reversed, mode='lines', name='Predicted Validation'))

    fig.update_layout(title=title, xaxis_title='Date', yaxis_title='Value', autosize=True)
    
    fig.show()
In [ ]:
# Plot the actual vs predicted values for SARIMA models
plot_actual_vs_predicted(train_total, val_total, pred_total_train, pred_total_val, title='ARIMA - Total')
plot_actual_vs_predicted(train_cyber, val_cyber, pred_cyber_train, pred_cyber_val, title='ARIMA - Cyber')
plot_actual_vs_predicted(train_network, val_network, pred_network_train, pred_network_val, title='ARIMA - Network')
plot_actual_vs_predicted(train_email, val_email, pred_email_train, pred_email_val, title='ARIMA - Email')

# Plot the actual vs predicted values for Exponential Smoothing models
plot_actual_vs_predicted(train_total, val_total, pred_total_es_train, pred_total_es_val, title='Exponential Smoothing - Total')
plot_actual_vs_predicted(train_cyber, val_cyber, pred_cyber_es_train, pred_cyber_es_val, title='Exponential Smoothing - Cyber')
plot_actual_vs_predicted(train_network, val_network, pred_network_es_train, pred_network_es_val, title='Exponential Smoothing - Network')
plot_actual_vs_predicted(train_email, val_email, pred_email_es_train, pred_email_es_val, title='Exponential Smoothing - Email')
In [ ]:
def mean_absolute_percentage_error(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

def calculate_accuracy_metrics(train, val, pred_train, pred_val, title):
    # Reverse the differencing transformation
    train_reversed = train + train.iloc[0]
    val_reversed = val + val.iloc[0]
    pred_train_reversed = pred_train + train.iloc[0]
    pred_val_reversed = pred_val + val.iloc[0]
    mae_train = mean_absolute_error(train_reversed, pred_train_reversed)
    mse_train = mean_squared_error(train_reversed, pred_train_reversed)
    rmse_train = np.sqrt(mse_train)
    mape_train = mean_absolute_percentage_error(train_reversed, pred_train_reversed)
    mae_val = mean_absolute_error(val_reversed, pred_val_reversed)
    mse_val = mean_squared_error(val_reversed, pred_val_reversed)
    rmse_val = np.sqrt(mse_val)
    mape_val = mean_absolute_percentage_error(val_reversed, pred_val_reversed)
    print(f'{title} Training\nMAE: {mae_train:.2f}, MSE: {mse_train:.2f}, RMSE: {rmse_train:.2f}, MAPE: {mape_train:.2f}')
    print(f'{title} Validation\nMAE: {mae_val:.2f}, MSE: {mse_val:.2f}, RMSE: {rmse_val:.2f}, MAPE: {mape_val:.2f}')
In [ ]:
# print metrics for SARIMA models
calculate_accuracy_metrics(train_total, val_total, pred_total_train, pred_total_val, title='ARIMA - Total')
calculate_accuracy_metrics(train_cyber, val_cyber, pred_cyber_train, pred_cyber_val, title='ARIMA - Cyber')
calculate_accuracy_metrics(train_network, val_network, pred_network_train, pred_network_val, title='ARIMA - Network')
calculate_accuracy_metrics(train_email, val_email, pred_email_train, pred_email_val, title='ARIMA - Email')
ARIMA - Total Training
MAE: 5.87, MSE: 66.53, RMSE: 8.16, MAPE: 21.28
ARIMA - Total Validation
MAE: 14.32, MSE: 325.55, RMSE: 18.04, MAPE: 16.84
ARIMA - Cyber Training
MAE: 4.07, MSE: 57.16, RMSE: 7.56, MAPE: 76.26
ARIMA - Cyber Validation
MAE: 11.14, MSE: 191.71, RMSE: 13.85, MAPE: 18.92
ARIMA - Network Training
MAE: 2.98, MSE: 37.53, RMSE: 6.13, MAPE: 66.61
ARIMA - Network Validation
MAE: 9.46, MSE: 141.42, RMSE: 11.89, MAPE: 20.64
ARIMA - Email Training
MAE: 2.11, MSE: 10.46, RMSE: 3.23, MAPE: 49.58
ARIMA - Email Validation
MAE: 8.54, MSE: 95.53, RMSE: 9.77, MAPE: 31.75
In [ ]:
# print metrics for Exponential Smoothing Models
calculate_accuracy_metrics(train_total, val_total, pred_total_es_train, pred_total_es_val, title='Exponential Smoothing - Total')
calculate_accuracy_metrics(train_cyber, val_cyber, pred_cyber_es_train, pred_cyber_es_val, title='Exponential Smoothing - Cyber')
calculate_accuracy_metrics(train_network, val_network, pred_network_es_train, pred_network_es_val, title='Exponential Smoothing - Network')
calculate_accuracy_metrics(train_email, val_email, pred_email_es_train, pred_email_es_val, title='Exponential Smoothing - Email')
Exponential Smoothing - Total Training
MAE: 5.81, MSE: 69.81, RMSE: 8.36, MAPE: 20.67
Exponential Smoothing - Total Validation
MAE: 14.12, MSE: 301.53, RMSE: 17.36, MAPE: 16.60
Exponential Smoothing - Cyber Training
MAE: 4.11, MSE: 57.29, RMSE: 7.57, MAPE: 59.60
Exponential Smoothing - Cyber Validation
MAE: 18.68, MSE: 530.12, RMSE: 23.02, MAPE: 31.01
Exponential Smoothing - Network Training
MAE: 3.20, MSE: 38.07, RMSE: 6.17, MAPE: 66.11
Exponential Smoothing - Network Validation
MAE: 8.63, MSE: 100.25, RMSE: 10.01, MAPE: 20.93
Exponential Smoothing - Email Training
MAE: 2.11, MSE: 10.34, RMSE: 3.22, MAPE: 41.64
Exponential Smoothing - Email Validation
MAE: 12.69, MSE: 196.98, RMSE: 14.03, MAPE: 46.51
In [ ]:
models = ['ARIMA Training', 'Exponential Smoothing Training']
mae = [5.87, 5.81]
mse = [66.53, 69.81]
rmse = [8.16, 8.36]
mape = [21.28, 20.67]

trace1 = go.Bar(x=models, y=mae, name='MAE', text=mae, textposition='auto')
trace2 = go.Bar(x=models, y=mse, name='MSE', text=mse, textposition='auto')
trace3 = go.Bar(x=models, y=rmse, name='RMSE', text=rmse, textposition='auto')
trace4 = go.Bar(x=models, y=mape, name='MAPE', text=mape, textposition='auto')

layout = go.Layout(
    title='Training Total Breaches Model Comparison',
    barmode='group'
)

fig = go.Figure(data=[trace1, trace2, trace3, trace4], layout=layout)

fig.show()
In [ ]:
models = ['ARIMA Validation', 'Exponential Smoothing Validation']
mae_val = [16.25, 16.11]
mse_val = [485.49, 469.57]
rmse_val = [22.03, 21.67]
mape_val = [22.65, 22.55]

trace1 = go.Bar(x=models, y=mae_val, name='MAE', text=mae_val, textposition='auto')
trace2 = go.Bar(x=models, y=mse_val, name='MSE', text=mse_val, textposition='auto')
trace3 = go.Bar(x=models, y=rmse_val, name='RMSE', text=rmse_val, textposition='auto')
trace4 = go.Bar(x=models, y=mape_val, name='MAPE', text=mape_val, textposition='auto')

layout_val = go.Layout(
    title='Model Comparison for Validation',
    barmode='group'
)

fig_val = go.Figure(data=[trace1, trace2, trace3, trace4], layout=layout_val)

fig_val.show()

Ultimately, I am selecting Exponential Smoothing having slightly better validation RMSE scores. I am not concerned with precise accuracy but generally close counts, but these models behaved so closely that their performance differences might be negligible.

Model Forecasting¶

Define pipeline functions

In [ ]:
def exponential_smoothing_forecast_pipeline(entire_time_series, desired_periods = 20):
    differenced_series = difference_series(entire_time_series) # Apply similar transformation developed in test
    model = ExponentialSmoothing(differenced_series, seasonal_periods=12, seasonal='add', trend='add').fit()
    predictions = model.forecast(steps=desired_periods)
    
    # Reversing the differencing but this time according to the number of forecast periods
    for i in range(desired_periods):
        if i < len(entire_time_series):
            predictions.iloc[i] += entire_time_series.iloc[-(i+1)]
        else:
            predictions.iloc[i] += predictions.iloc[i - len(entire_time_series)]
            
     # Rounding down to the nearest integer to avoid counting partial breaches
    predictions = np.floor(predictions)
    
    return predictions
In [ ]:
total_breaches_forecasted_through_to_2025 = exponential_smoothing_forecast_pipeline(monthly_breaches)
cyber_breaches_forecasted_through_to_2025 = exponential_smoothing_forecast_pipeline(monthly_breaches_cyber)
network_breaches_forecasted_through_to_2025 = exponential_smoothing_forecast_pipeline(monthly_breaches_network)
email_breaches_forecasted_through_to_2025 = exponential_smoothing_forecast_pipeline(monthly_breaches_email)
In [ ]:
total_breaches_forecasted_through_to_2025
Out[ ]:
2023-05-31    41.0
2023-06-30    63.0
2023-07-31    48.0
2023-08-31    34.0
2023-09-30    50.0
2023-10-31    53.0
2023-11-30    72.0
2023-12-31    67.0
2024-01-31    50.0
2024-02-29    68.0
2024-03-31    79.0
2024-04-30    74.0
2024-05-31    54.0
2024-06-30    46.0
2024-07-31    50.0
2024-08-31    46.0
2024-09-30    64.0
2024-10-31    69.0
2024-11-30    57.0
2024-12-31    46.0
Freq: M, dtype: float64
In [ ]:
cyber_breaches_forecasted_through_to_2025
Out[ ]:
2023-05-31    30.0
2023-06-30    48.0
2023-07-31    36.0
2023-08-31    19.0
2023-09-30    37.0
2023-10-31    40.0
2023-11-30    48.0
2023-12-31    58.0
2024-01-31    36.0
2024-02-29    59.0
2024-03-31    65.0
2024-04-30    54.0
2024-05-31    41.0
2024-06-30    43.0
2024-07-31    43.0
2024-08-31    37.0
2024-09-30    54.0
2024-10-31    52.0
2024-11-30    44.0
2024-12-31    35.0
Freq: M, dtype: float64
In [ ]:
network_breaches_forecasted_through_to_2025
Out[ ]:
2023-05-31    27.0
2023-06-30    42.0
2023-07-31    28.0
2023-08-31    19.0
2023-09-30    28.0
2023-10-31    35.0
2023-11-30    46.0
2023-12-31    46.0
2024-01-31    28.0
2024-02-29    38.0
2024-03-31    42.0
2024-04-30    25.0
2024-05-31    25.0
2024-06-30    33.0
2024-07-31    26.0
2024-08-31    26.0
2024-09-30    39.0
2024-10-31    40.0
2024-11-30    27.0
2024-12-31    22.0
Freq: M, dtype: float64
In [ ]:
email_breaches_forecasted_through_to_2025
Out[ ]:
2023-05-31     6.0
2023-06-30    13.0
2023-07-31    10.0
2023-08-31     6.0
2023-09-30    12.0
2023-10-31    11.0
2023-11-30    14.0
2023-12-31     8.0
2024-01-31    12.0
2024-02-29    15.0
2024-03-31    11.0
2024-04-30    24.0
2024-05-31    12.0
2024-06-30    10.0
2024-07-31    18.0
2024-08-31    12.0
2024-09-30    16.0
2024-10-31    16.0
2024-11-30    18.0
2024-12-31    13.0
Freq: M, dtype: float64

Results¶

In [ ]:
# Create traces
trace_total = go.Scatter(
    x = total_breaches_forecasted_through_to_2025.index,
    y = total_breaches_forecasted_through_to_2025,
    mode = 'lines',
    name = 'Total Breaches'
)

trace_cyber = go.Scatter(
    x = cyber_breaches_forecasted_through_to_2025.index,
    y = cyber_breaches_forecasted_through_to_2025,
    mode = 'lines',
    name = 'Cyber Breaches'
)

trace_network = go.Scatter(
    x = network_breaches_forecasted_through_to_2025.index,
    y = network_breaches_forecasted_through_to_2025,
    mode = 'lines',
    name = 'Network Breaches'
)

trace_email = go.Scatter(
    x = email_breaches_forecasted_through_to_2025.index,
    y = email_breaches_forecasted_through_to_2025,
    mode = 'lines',
    name = 'Email Breaches'
)

data = [trace_total, trace_cyber, trace_network, trace_email]

# Define layout
layout = go.Layout(
    title = 'Breach Type Forecast Comparison (2023-2025)',
    xaxis = dict(title = 'Date'),
    yaxis = dict(title = 'Number of Breaches')
)

fig = go.Figure(data=data, layout=layout)
fig.show()

The models predict that breaches will continue to rise, although not dramatically so, although cyber breaches seem to be overestimated.

When applying the estimated total breach cost of the historical dataset to the forecasted counts, one that might be conservative, breaches may cost the healthcare industry 9 billion dollars in the next year and a half.

In [ ]:
forecast_total_breach_cost_estimation = total_breaches_forecasted_through_to_2025.sum() * df['Estimated Total Breach Cost'].mean()
In [ ]:
print(f'${forecast_total_breach_cost_estimation:,.2f}')
$9,030,055,980.31

To examine the whole story, the historical data and forecasted data was plotted together.

In [ ]:
trace_total_hist = go.Scatter(
    x = monthly_breaches.index,
    y = monthly_breaches,
    mode = 'lines',
    name = 'Total Breaches - Historical',
    line = dict(color='darkblue')
)

trace_cyber_hist = go.Scatter(
    x = monthly_breaches_cyber.index,
    y = monthly_breaches_cyber,
    mode = 'lines',
    name = 'Cyber Breaches - Historical',
    line = dict(color='darkred')
)

trace_network_hist = go.Scatter(
    x = monthly_breaches_network.index,
    y = monthly_breaches_network,
    mode = 'lines',
    name = 'Network Breaches - Historical',
    line = dict(color='darkgreen')
)

trace_email_hist = go.Scatter(
    x = monthly_breaches_email.index,
    y = monthly_breaches_email,
    mode = 'lines',
    name = 'Email Breaches - Historical',
    line = dict(color='darkorange')
)

# Create traces for forecasted data
trace_total_forecast = go.Scatter(
    x = total_breaches_forecasted_through_to_2025.index,
    y = total_breaches_forecasted_through_to_2025,
    mode = 'lines',
    name = 'Total Breaches - Forecasted',
    line = dict(color='lightblue')
)

trace_cyber_forecast = go.Scatter(
    x = cyber_breaches_forecasted_through_to_2025.index,
    y = cyber_breaches_forecasted_through_to_2025,
    mode = 'lines',
    name = 'Cyber Breaches - Forecasted',
    line = dict(color='salmon')
)

trace_network_forecast = go.Scatter(
    x = network_breaches_forecasted_through_to_2025.index,
    y = network_breaches_forecasted_through_to_2025,
    mode = 'lines',
    name = 'Network Breaches - Forecasted',
    line = dict(color='lightgreen')
)

trace_email_forecast = go.Scatter(
    x = email_breaches_forecasted_through_to_2025.index,
    y = email_breaches_forecasted_through_to_2025,
    mode = 'lines',
    name = 'Email Breaches - Forecasted',
    line = dict(color='orange')
)

data = [trace_total_hist, trace_cyber_hist, trace_network_hist, trace_email_hist,
        trace_total_forecast, trace_cyber_forecast, trace_network_forecast, trace_email_forecast]

layout = go.Layout(
    title = 'Breach Type Comparison (Historical(dark) and Forecasted(light))',
    xaxis = dict(title = 'Date'),
    yaxis = dict(title = 'Number of Breaches')
)

fig = go.Figure(data=data, layout=layout)
fig.show()

When looking at 2018-2025, we can see the historical cutoff just before the middle of 2023 where we'd expect it today. The lighter colors represent forecasted values relative to their darker historical counterparts. Rewinding to the last quarter of 2020, we can see the breaches carried out while healthcare was dealing with the shock of Covid. It was the perfect storm for healthcare providers, in that remote work became necessary very quickly for as much staff as possible, creating holes in their cybersecurity, as well as the overall influx of work and patients due to Covid. This is pertinent to the modeling in that this noise could have heavily spiked future predictions, but the models did not seem to be dramatically affected by this. Unless something as impactful as Covid occurs, another spike should not occur. Yet, breaches should still be expected to become more frequent with cyber breaches accounting for a majority of them.

Sources¶

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7349636/ - Previous Forecasting Attempts https://ocrportal.hhs.gov/ocr/breach/breach_report.jsf - Breach Dataset https://www.hhs.gov/hipaa/for-professionals/privacy/guidance/business-associates/index.html - Covered Entity Types Table https://www.ibm.com/downloads/cas/3R8N1DZJ - Utilized to assist with imputing average cost per record https://search.r-project.org/CRAN/refmans/Ecdat/html/HHSCyberSecurityBreaches.html - used to fill in missing missing breach type for one record https://www.healthdatamanagement.com/articles/tax-fraud-fueled-breach-hits-36th-fire-department - used to fill in missing Individuals Affected for one record. https://www.ibm.com/downloads/cas/OJDVQGRY - Source for US dollar influence on record cost https://www.experian.com/blogs/ask-experian/heres-how-much-your-personal-information-is-selling-for-on-the-dark-web/ - Used to reinforce significant value of healthcare records on the dark web